The Analysis


The Analysis on the cleaned data

In [1]:
import pandas as pd
from textblob import TextBlob
from wordcloud import WordCloud
from collections import Counter
import json
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly import graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
In [3]:
df = pd.read_csv('Data/cleaned_data.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()
Out[3]:
created_at id full_text retweet_count favorite_count hashtags clean_text
0 2020-09-15 04:54:38 1305731732453208064 RT @CDCgov: The latest CDC #COVIDView report s... 160 0 ['COVIDView', 'COVID19'] latest cdc covidview reposhows percentage er v...
1 2020-09-15 08:19:54 1305783389044109313 RT @drvox: Whistleblower from ICE detention fa... 12571 0 [] whistleblower ice detention facility files com...
2 2020-09-15 11:18:30 1305828335302344706 RT @CattHarmony: A federal judge ruled PA gove... 400 0 [] afederal judge ruled pa governors coronavirus ...
3 2020-09-15 12:03:58 1305839776994598912 RT @haveigotnews: ‘I would call the police on ... 8743 0 [] police people flouting covid rules says woman ...
4 2020-09-15 13:48:06 1305865981005385728 RT @robbysoave: This sounds too good to be tru... 380 0 [] sounds good true assume walked qualified thous...
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001342 entries, 0 to 1001341
Data columns (total 7 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   created_at      1001342 non-null  object
 1   id              1001342 non-null  int64 
 2   full_text       1001342 non-null  object
 3   retweet_count   1001342 non-null  int64 
 4   favorite_count  1001342 non-null  int64 
 5   hashtags        1001342 non-null  object
 6   clean_text      1000963 non-null  object
dtypes: int64(3), object(4)
memory usage: 53.5+ MB

Date based Analysis

In [5]:
df['created_at'] = pd.to_datetime(df['created_at'])
In [6]:
df.head()
Out[6]:
created_at id full_text retweet_count favorite_count hashtags clean_text
0 2020-09-15 04:54:38 1305731732453208064 RT @CDCgov: The latest CDC #COVIDView report s... 160 0 ['COVIDView', 'COVID19'] latest cdc covidview reposhows percentage er v...
1 2020-09-15 08:19:54 1305783389044109313 RT @drvox: Whistleblower from ICE detention fa... 12571 0 [] whistleblower ice detention facility files com...
2 2020-09-15 11:18:30 1305828335302344706 RT @CattHarmony: A federal judge ruled PA gove... 400 0 [] afederal judge ruled pa governors coronavirus ...
3 2020-09-15 12:03:58 1305839776994598912 RT @haveigotnews: ‘I would call the police on ... 8743 0 [] police people flouting covid rules says woman ...
4 2020-09-15 13:48:06 1305865981005385728 RT @robbysoave: This sounds too good to be tru... 380 0 [] sounds good true assume walked qualified thous...
In [7]:
import warnings
warnings.filterwarnings("ignore")
In [8]:
temp = df[['created_at']]
temp['created_at'] = temp['created_at'].dt.date
temp = temp['created_at'].value_counts().reset_index()
temp.sort_values(by='index', inplace=True)
temp.head()
Out[8]:
index created_at
4 2020-09-15 146685
2 2020-09-16 162102
1 2020-09-17 181793
0 2020-09-18 216600
5 2020-09-19 134390
In [9]:
fig = plt.figure(figsize=(20,10))
ax = sns.barplot(x="index", y="created_at", hue="index", data=temp, dodge=False)

Top 5 Retweeted Tweet

In [10]:
df.dropna(subset=['clean_text'], inplace=True)
df.drop_duplicates('full_text', inplace=True)
In [11]:
fav = df[['retweet_count','full_text']].sort_values('retweet_count',ascending = False)[:5].reset_index()
for i in range(5):
    print('{}). {} Counts\n==> {}\n'.format(i+1, fav['retweet_count'][i], fav['full_text'][i]))
1). 346507 Counts
==> RT @baeonda: I’m 22 years old and I tested positive for COVID-19. 

I’ve been debating on posting, but I want to share my experience especi…

2). 314167 Counts
==> RT @elonmusk: The coronavirus panic is dumb

3). 269812 Counts
==> RT @JoeBiden: I’m Joe Biden and I approve this message. https://t.co/TuRZXPE5xK

4). 243722 Counts
==> RT @BrynnTannehill: I don't think anything I've seen so perfectly captures why there's no way the US is going to be getting on top of COVID…

5). 240917 Counts
==> RT @DiageoLiam: The World Health Organization has announced that dogs cannot contract Covid-19. Dogs previously held in quarantine can now…

Top 5 Favorite Tweets

In [12]:
fav = df[['favorite_count','full_text']].sort_values('favorite_count',ascending = False)[:5].reset_index()
for i in range(5):
    print('{}). {} Counts\n==> {}\n'.format(i+1, fav['favorite_count'][i], fav['full_text'][i]))
1). 174725 Counts
==> Mitch McConnell is trying to pass a Supreme Court nomination before passing a Coronavirus bill.

Tells you everything you need to know.

2). 140741 Counts
==> The USPS had a plan to send 5 reusable facemarks to every household in early April. Even had a press release ready.

The White House blocked the plan.

 “There was concern...that households receiving masks might create concern or panic." https://t.co/pYABjdzTCM https://t.co/v4BLKRMPOc

3). 108283 Counts
==> 7 months of a pandemic and i haven't lost a family member or friend to Covid and I'm grateful ❤

4). 97016 Counts
==> Why would the president ever — ever — when talking about American COVID deaths, start a sentence with, "If you take the blue states out..."?

5). 89190 Counts
==> Biden is doing very, very well so far at this @CNN townhall. He is strongest as a retail politician, interacting with average Americans (which I have missed in covid). He's also incredibly empathetic to struggle and pain. If he brings this energy to the debates, Trump is toast.

In [13]:
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
    return TextBlob(text).sentiment.polarity
In [14]:
def analyseSentiment(score):
    if score < 0:
        return 'Negative'
    elif score ==0:
        return 'Neutral'
    else:
        return 'Positive'
In [15]:
df['Subjectivity'] = df['clean_text'].apply(getSubjectivity)
df['Polarity'] = df['clean_text'].apply(getPolarity)
df['Sentiment'] = df['Polarity'].apply(analyseSentiment)
In [16]:
temp = df.groupby('Sentiment').count()['clean_text'].reset_index().sort_values(by='clean_text',ascending=False)
temp.style.background_gradient(cmap='Purples')
Out[16]:
Sentiment clean_text
1 Neutral 37261
2 Positive 28198
0 Negative 15157
In [17]:
fig = px.bar(temp,
             x='Sentiment', y='clean_text',
             title='Sentiment Analysis',
             labels={'Sentiment':'Sentimetns', 'clean_text':'Tweets count'}
            )
fig.show()
In [18]:
fig = go.Figure(go.Funnelarea(
    text =temp.Sentiment,
    values = temp.clean_text,
    title = {"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"}
    ))
fig.show()
In [19]:
plt.figure(1, figsize=(10,6))
plt.hist(df["created_at"],bins = 100);
plt.xlabel('Hours',size = 15)
plt.ylabel('No. of Tweets',size = 15)
plt.title('No. of Tweets per Hour',size = 15)
Out[19]:
Text(0.5, 1.0, 'No. of Tweets per Hour')
In [20]:
plt.figure(figsize=(10,6))
sns.distplot(df['Polarity'], bins=30)
plt.title('Sentiment Distribution',size = 15)
plt.xlabel('Polarity',size = 15)
plt.ylabel('Frequency',size = 15)
plt.show();

Word Count

In [21]:
words = []
words = [word for i in df.clean_text for word in i.split()]
freq = Counter(words).most_common(30)
freq = pd.DataFrame(freq)
freq.columns = ['word', 'frequency']
freq.head()
Out[21]:
word frequency
0 covid 19368
1 coronavirus 14002
2 people 6042
3 new 5795
4 trump 5639
In [22]:
plt.figure(figsize = (15, 10))
sns.barplot(y="word", x="frequency",data=freq)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x231e1825bc8>
In [23]:
neg_df = df[df['Sentiment'] == 'Negative']
neu_df = df[df['Sentiment'] == 'Neutral']
pos_df = df[df['Sentiment'] == 'Positive']
In [24]:
allWords = ''.join([twts for twts in df['clean_text']])
wordCloud = WordCloud(width=1000, height=700, background_color="white", random_state=21, max_font_size=119).generate(allWords)
wordCloud.to_file('All_Data.png')
plt.figure(num=None, figsize=(25, 8), dpi=180, edgecolor='k')
plt.imshow(wordCloud, interpolation="bilinear")
plt.axis('off')
plt.show()
In [25]:
allWords = ''.join([twts for twts in neg_df['clean_text']])
wordCloud = WordCloud(width=1000, height=700, background_color="white", random_state=21, max_font_size=119).generate(allWords)
wordCloud.to_file('ngwords_wordcloud.png')
plt.figure(num=None, figsize=(25, 8), dpi=180, edgecolor='k')
plt.imshow(wordCloud, interpolation="bilinear")
plt.axis('off')
plt.show()
In [26]:
allWords = ''.join([twts for twts in neu_df['clean_text']])
wordCloud = WordCloud(width=1000, height=700, background_color="white", random_state=21, max_font_size=119).generate(allWords)
wordCloud.to_file('neuwords_wordcloud.png')
plt.figure(num=None, figsize=(25, 8), dpi=180, edgecolor='k')
plt.imshow(wordCloud, interpolation="bilinear")
plt.axis('off')
plt.show()
In [27]:
allWords = ''.join([twts for twts in pos_df['clean_text']])
wordCloud = WordCloud(width=1000, height=700, background_color="white", random_state=21, max_font_size=119).generate(allWords)
wordCloud.to_file('poswords_wordcloud.png')
plt.figure(num=None, figsize=(25, 8), dpi=180, edgecolor='k')
plt.imshow(wordCloud, interpolation="bilinear")
plt.axis('off')
plt.show()
In [28]:
df_covid = pd.read_csv('Data/covid_cases.csv')
df_covid.drop('Unnamed: 0', axis=1, inplace=True)
In [29]:
start_date = df['created_at'][0]
end_date = df['created_at'][len(df)-1]
In [30]:
df_covid['Date'] = pd.to_datetime(df_covid['Date'])
In [31]:
mask = (df_covid['Date'] >= start_date) & (df_covid['Date'] <= end_date)
covid_cases = df_covid.loc[mask]
len(covid_cases)
Out[31]:
11862
In [32]:
temp = covid_cases.groupby('Date').sum()
temp.reset_index(inplace=True)
In [33]:
fig = go.Figure(data=[
    go.Bar(x=temp['Date'], y=temp['Confirmed'], name='Confirmed Cases'),
    go.Bar(x=temp['Date'], y=temp['Recovered'], name='Recovered Cases'),
    go.Bar(x=temp['Date'], y=temp['Deaths'], name='Deaths Cases')
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()
In [34]:
temp = covid_cases.groupby('Country/Region').sum()
temp.reset_index(inplace=True)

top_confirmed_cases = temp.sort_values(by='Confirmed', ascending=False)
top_recovered_Cases = temp.sort_values(by='Recovered', ascending=False)
top_deaths_cases = temp.sort_values(by='Deaths', ascending=False)
In [35]:
fig = make_subplots(rows=1, cols=3,
                    shared_yaxes=True,
                    horizontal_spacing = 0.01,
                    subplot_titles=('Top 5 Countries in Confirmed',
                                    'Top 5 Countries in Recovered',
                                    'Top 5 Countries in Deaths'))

fig.add_trace(go.Bar(x=top_confirmed_cases['Country/Region'][:6],
                     y=top_confirmed_cases['Confirmed'][:6],
                     marker=dict(color=[4, 5, 6], coloraxis="coloraxis")),
              1, 1)

fig.add_trace(go.Bar(x=top_recovered_Cases['Country/Region'][:6],
                     y=top_recovered_Cases['Recovered'][:6],
                    marker=dict(color=[2, 3, 5], coloraxis="coloraxis")),
              1, 2)

fig.add_trace(go.Bar(x=top_deaths_cases['Country/Region'][:6],
                     y=top_deaths_cases['Deaths'][:6],
                    marker=dict(color=[2, 3, 5], coloraxis="coloraxis")),
              1, 3)

fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False)
fig.show()

Hashtags

In [36]:
hashtags = []
In [37]:
for i in df['hashtags']:
    temp = i.replace('\'', "").strip('][').split(', ')
    if '' not in temp:
        hashtags.extend(temp)
In [38]:
temp = pd.DataFrame.from_dict({
    'Hashtags':hashtags,
})
In [39]:
temp.head()
Out[39]:
Hashtags
0 COVIDView
1 COVID19
2 PMNarendraModiBirthday
3 Bi
4 COVID__19
In [40]:
temp = temp['Hashtags'].value_counts().reset_index()
fig = plt.figure(figsize=(20,10))
ax = sns.barplot(x="index", y="Hashtags", hue="index", data=temp[:10], dodge=False)

Saving Analyzed Data


In [41]:
df.head()
Out[41]:
created_at id full_text retweet_count favorite_count hashtags clean_text Subjectivity Polarity Sentiment
0 2020-09-15 04:54:38 1305731732453208064 RT @CDCgov: The latest CDC #COVIDView report s... 160 0 ['COVIDView', 'COVID19'] latest cdc covidview reposhows percentage er v... 0.800 0.050 Positive
1 2020-09-15 08:19:54 1305783389044109313 RT @drvox: Whistleblower from ICE detention fa... 12571 0 [] whistleblower ice detention facility files com... 0.100 -0.150 Negative
2 2020-09-15 11:18:30 1305828335302344706 RT @CattHarmony: A federal judge ruled PA gove... 400 0 [] afederal judge ruled pa governors coronavirus ... 0.000 0.000 Neutral
3 2020-09-15 12:03:58 1305839776994598912 RT @haveigotnews: ‘I would call the police on ... 8743 0 [] police people flouting covid rules says woman ... 0.000 0.000 Neutral
4 2020-09-15 13:48:06 1305865981005385728 RT @robbysoave: This sounds too good to be tru... 380 0 [] sounds good true assume walked qualified thous... 0.625 0.525 Positive
In [42]:
covid_cases.head()
Out[42]:
Country/Region Province/State Latitude Longitude Confirmed Recovered Deaths Date
0 Afghanistan NaN 33.939110 67.709953 38855.0 32503.0 1436.0 2020-09-16
1 US North Carolina 36.060929 -79.121679 2543.0 0.0 54.0 2020-09-16
2 US North Carolina 35.152533 -76.665598 224.0 0.0 2.0 2020-09-16
3 US North Carolina 36.267238 -76.251348 631.0 0.0 29.0 2020-09-16
4 US North Carolina 34.522656 -77.903521 847.0 0.0 5.0 2020-09-16
In [43]:
df.to_csv('data/analysed_data.csv')
covid_cases.to_csv('data/covid_stats.csv')